Motivation

This year, a report from Centers for Disease Control and Prevention revealed that America’s obesity rate has reached a record high. Contrasting to popular belief, New Yorkers are not so safe from the obesity epidemic, as more than half of adult New Yorkers are either overweight or obese. Studies show that the rise in the obesity epidemic is partly due to disparities in food environment; it is harder for some to eat healthier because their options are limited.

For example, I and my fellow classamate often go out to McDonald that are near our home for convenience, even though these restaurants were not healthy. Environment can affect our health behaviours imperceptibly.

This project intends to look deeper into the relationship between food environment in NYC and obesity rate along with diabetes rate and stroke hospitalization rate.

Initial questions

  1. Can the percentage of nation-wide chain and fast-food restaurants in different neighborhood of NYC be related to their obesity, diabetes and stroke hospitalization rate?

  2. Can the percentage of fast-food restaurant (defined by cuisine description in the restaurant inspection dataset) in different neighborhood of NYC be related to their obesity, diabetes and stroke hospitalization rate ?

  3. Can the percentage of health inspection grade A restaurant in different neighborhood of NYC be related to their obesity, diabetes and stroke hospitalization rate?

  4. Can the composite restaurant health score of different neighborhood of NYC be related to their obesity, diabetes and stroke hospitalization rate?

For the first three questions, we stick to it. Since the percentage of health inspection grade A restaurant is not a significant predictor, we later put it into the model as a confounder. Moreover, we deleted the fourth problem. Composite restaurant health score were to subjective and difficult to assess. Moreover, we calculated the boro level model first due to the difficulties we encountered in scaping zipcode-neighborhood data. However, after some struggle, we still managed to get the neighborhood information and analyze the data accordingly.

Data and Methods

Data Source and Collection

1. NYC restaurant inspection:

https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j

This dataset contains the data for restaurant inspection in NYC from August 1, 2014 to December 5th, 2017. Every row is a restaurant inspection record that includes the name of the restaurant, zipcode, cuisine type description, inspection grade (A as the best grade) and so on. Since this dataset has the inspection data for 3 years, it is almost certained that it has the data for all the restuarants in NYC in 2014.

Variable used in this datasets

DBA: Name of the restaurant
BORO: name of the boro
ZIPCODE: the zipcode of the restaurant
CUISINE DESCRIPTION: the kind of food that the restaurant is providing
GRADE: Inspection grade for the specific inspection.

2. 2015 Community Health Profiles Open Data:

https://www1.nyc.gov/site/doh/data/data-publications/profiles.page

This dataset contains NYC every neighborhoods’ demographic (percentage of white race, poverty percentage), health (age-adjusted percent of adult exercised in the last 30 days, age-adjusted percent of adults as a smoker) and our main outcome (Age-adjusted percent of adults that is obese (BMI of 30 or gthe reater), ge-adjusted percent of adults, Age-adjusted rate of hospitalizations due to stroke (cerebrovascular disease) per 100,000 adults)

Variable used in this datasets

Name: the name for the neigborhood.
Racewhite_Rate: the percentage for white race
Poverty: Percent of individuals living below the federal poverty threshold
Smoking: age-adjusted percent of adults as a smoker Exercise: Age-adjusted percent of adults that reported getting any exercise in the last 30 days
Obesity: Age-adjusted percent of adults that is obese (BMI of 30 or greater) based on self-reported height and weight
Diabetes: Age-adjusted percent of adults that had ever been told by a healthcare professional that they have diabetes
Stroke_Hosp: Age-adjusted rate of hospitalizations due to stroke (cerebrovascular disease) per 100,000 adults

3. Web scraping for what zipcodes each neighborhood contains

Because Community Health Profiles Open Data is has only neighborhood level information, so we have to aggregate neighborhood level information from the restaurant inspeciton data. However, the restaurant inspection data was using zipcodes for every restaurant. Therefore, we had to add the neighborhood name to every restaurant, so that we can group by neighborhood then calculate the percentage for every neighborhood.

First, we scraped the table data from https://www.health.ny.gov/statistics/cancer/registry/appendix/neighborhoods.htm. However the neighborhood name and combinations were different from the health profile data we downloaded. We cannot merge the two datasets. Then we tried to look for the raw classification for the neigborhoods in health data from New York University’s Furman Center for Real Estate and Urban Policy and the NYC Department of City Planning. However, still no luck becuase the neighborhood was not divided by zipcode areas. Then we tried the hardest way: search the name of the neighborhood on Google and finding the matching zipcodes. It worked out well, but it was not precise, because neighborhoods were not divided according to the area of the zipcode. Different neighborhoods have the same zipcodes.

In order to be unbiased in this process, if two neighborhoods contains the same zipcode area, we will randomly assign this zipcode to only one neighborhood.

Data Import and Cleaning

First, we download restaurant inspection data from NYC open data.

get_all_inspections = function(url) {
  
  all_inspections = vector("list", length = 0)
  
  loop_index = 1
  chunk_size = 50000
  DO_NEXT = TRUE
  
  while (DO_NEXT) {
    message("Getting data, page ", loop_index)
    
    all_inspections[[loop_index]] = 
      GET(url,
          query = list(`$order` = "zipcode",
                       `$limit` = chunk_size,
                       `$offset` = as.integer((loop_index - 1) * chunk_size)
                       )
          ) %>%
      content("text") %>%
      fromJSON() %>%
      as_tibble()
    
    DO_NEXT = dim(all_inspections[[loop_index]])[1] == chunk_size
    loop_index = loop_index + 1
  }
  
  all_inspections
  
}

url = "https://data.cityofnewyork.us/resource/9w7m-hzhe.json"

rest_inspection = get_all_inspections(url) %>%
  bind_rows() 

Then, we download health and demographic data CSV into local folder, read in and clean it.

download.file("https://www1.nyc.gov/assets/doh/downloads/excel/episrv/2015_CHP_PUD.xlsx", mode="wb", destfile = "health.xlsx")
health <- read_excel("health.xlsx", sheet = "CHP_all_data") %>% 
  select(Name, Racewhite_Rate, Poverty, Unemployment,
         Smoking, Exercise,
         Obesity, Diabetes, Stroke_Hosp) %>% 
  clean_names() 

Lastly, we match the neighborhood names with restaurant zipcodes.

zip_neighbor <- read_csv("neigh_zipcode.csv") %>% 
  mutate(zipcode = as.character(zipcode))
##restaurant data with neighbourhood
rest_neighborhood = left_join(rest_inspection, zip_neighbor, by = "zipcode") %>% 
  filter(!is.na(neighborhood))

Exploratory Analysis

Health Data

health_only_neighborhood <- health[-c(1:6),] %>% 
  rename(neighborhood = name) %>% 
  mutate(neighborhood = as.factor(neighborhood))
##Plotting for outcome in different neighborhood


bar_obe <- health_only_neighborhood %>%
  mutate(neighborhood = fct_reorder(neighborhood, obesity)) %>% 
  filter(obesity > 25) %>% 
  ggplot(aes(x = neighborhood,y = obesity, fill = neighborhood)) + geom_col() +  
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        legend.position = "none") +
  labs(title = "Neighborhood with 25 percent more obesity rate")


bar_dia <- health_only_neighborhood %>%
  mutate(neighborhood = fct_reorder(neighborhood, diabetes)) %>% 
  filter(diabetes > 10) %>% 
  ggplot(aes(x = neighborhood,y = diabetes, fill = neighborhood)) + geom_col() +   
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        legend.position = "none") +
  labs(title = "Neighborhood with 10 percent more diabetes rate")


bar_stro <- health_only_neighborhood %>%
  mutate(neighborhood = fct_reorder(neighborhood, stroke_hosp)) %>% 
  filter(stroke_hosp > 300) %>% 
  ggplot(aes(x = neighborhood,y = stroke_hosp, fill = neighborhood)) + 
  geom_col() +    
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        legend.position = "none") +
  labs(title = "Neighborhood with 300 more stroke hospitalization in 100,000 adults")

ggplotly(bar_obe)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
ggplotly(bar_dia)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
ggplotly(bar_stro)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

Williamsbridge and Baychester have the highest obesity rate, up to 32%. East New York and Starrett City have the highest diabetes rate, up to 17 percent. BUshwick has the highest stroke hospitalization in 100,000 adults, up to 300 adults. Morrisania and Crotona has the highest percent of poverty rate, up to 40%. Fiancial district has the highest percent of young peoeple, up to 67%.

Restaurant data

Fastfood restaurants by cuisine type

Fastfood restaurants are identified by their cuisine descriptions given in the inspection data. We print out the cuisine type list (n=85) and let everyone circle the ones they think is fastfood and we use the union as our rule (use union because it’s more conservative).

We classify cuisine descriptions “Bagels/Pretzels”, “Bottled beverages, including water, sodas, juices, etc.”, “Chicken”, “Delicatessen”, “Donuts”, “Hamburgers”, “Hotdogs”, “Hotdogs/Pretzels”, “Ice Cream, Gelato, Yogurt, Ices”, “Nuts/Confectionary”, “Pancakes/Waffles”, “Pizza”, “Soul Food”, “Sandwiches”, “Sandwiches/Salads/Mixed Buffet” and “Soups & Sandwiches” as fastfood restaurants. And then we calculate the total number of restaurants…

# calculating the total number of restaurants and the number of fastfood restaurants in the neighborhood, as well as the percentage of fastfood restaurants.

neighborhood_list = 
  rest_neighborhood %>%
  distinct(neighborhood) %>%
  arrange(neighborhood)
  
rest_fastfood_neighborhood = 
  rest_neighborhood %>%
  filter(cuisine_description %in% c("Bagels/Pretzels",
                                    "Bottled beverages, including water, sodas, juices, etc.",
                                    "Chicken",
                                    "Delicatessen",
                                    "Donuts",
                                    "Hamburgers",
                                    "Hotdogs",
                                    "Hotdogs/Pretzels",
                                    "Ice Cream, Gelato, Yogurt, Ices",
                                    "Nuts/Confectionary",
                                    "Pancakes/Waffles",
                                    "Pizza",
                                    "Soul Food",
                                    "Sandwiches",
                                    "Sandwiches/Salads/Mixed Buffet",
                                    "Soups & Sandwiches"))

percent_fastfood_neighborhood = function(name_neighborhood){

  rest_each_neighborhood =
    rest_neighborhood %>%
    filter(neighborhood == name_neighborhood) %>%
    distinct(camis)
  
  n_rest_neighborhood = nrow(rest_each_neighborhood)

  rest_fastfood_distinct_neighborhood = 
    rest_fastfood_neighborhood %>%
    filter(neighborhood == name_neighborhood) %>%
    distinct(camis, cuisine_description)
  
  n_fastfood_neighborhood = nrow(rest_fastfood_distinct_neighborhood)
    
  percent_fastfood_neighborhood = n_fastfood_neighborhood/n_rest_neighborhood
  
  tibble(
    neighborhood = name_neighborhood,
    n_fastfood = n_fastfood_neighborhood,
    n_rest = n_rest_neighborhood,
    percent_fastfood = percent_fastfood_neighborhood
  )
}

fastfood_neighborhood = 
  map(neighborhood_list$neighborhood, percent_fastfood_neighborhood) %>%
  bind_rows() %>%
  mutate(neighborhood = str_to_upper(neighborhood))
fastfood_neighborhood %>%
  mutate(neighborhood = as.factor(neighborhood),
         n_rest = as.numeric(n_rest),
         n_nonfastfood = n_rest - n_fastfood,
         neighborhood = fct_reorder(neighborhood, percent_fastfood)) %>%
  plot_ly(., x = ~neighborhood, y = ~n_fastfood, type = 'bar', name = 'fastfood restaurants') %>%
  add_trace(y = ~n_nonfastfood, name = 'non-fastfood restaurants') %>%
  layout(yaxis = list(title = 'number of restaurants'), 
         xaxis = list(title = 'neighborhood (ordered by percentage of fastfood restaurants)',
                      tickangle = 45),
         barmode = 'stack')

From the plot, we can see that, while the Greenwich Village and SOHO neighborhood has fairly large number of restaurants, it has the smallest percentage of fastfood restaurants. Williamsbridge and Baychester has the largest percentage of fastfood restaurants.

When large number of total restaurants is not equal to large percentage of fastfood restaurants in that neighborhood, we can conclude that the distribution of fastfood restaurants is not even across neighborhoods, which also implies the motivation of our study, we want to investigate if this uneven distribution of fastfood restaurants is associated with diffferent level of chronic disease outcomes within a neighborhood.

Restaurant Chains

We scrape list of 75 national chain restaurants from the wikipedia page (https://en.wikipedia.org/wiki/List_of_restaurant_chains_in_the_United_States#Fast-casual) now I will join this dataset with restaurant inspection data to choose only the chain restaurants in NYC.

chains_html = read_html("https://en.wikipedia.org/wiki/List_of_restaurant_chains_in_the_United_States#Fast-casual")

chain_rest = chains_html %>%
  html_nodes("ul:nth-child(9) li , .jquery-tablesorter tr:nth-child(1) td:nth-child(1)") %>%
  html_text() %>%
  as.tibble() %>%
  mutate(dba = value, 
         dba = str_to_upper(dba)) %>%
  select(dba)
# read in the list of chain restaurants in us 
# made the names to uppercase and changed the var name to dba
head(chain_rest, 10)
## # A tibble: 10 x 1
##                       dba
##                     <chr>
##  1        A&W RESTAURANTS
##  2             APPLEBEE'S
##  3             BAJA FRESH
##  4          BOSTON MARKET
##  5     BUFFALO WILD WINGS
##  6                CHILI'S
##  7 CHIPOTLE MEXICAN GRILL
##  8           CICI'S PIZZA
##  9    COLD STONE CREAMERY
## 10     CORNER BAKERY CAFE

Then, we match list of chain restaurants in U.S. and restaurant inspection data.

# removing punctuations in chain_rest & restaurant inspections (neighborhoods)
chain_rest_str = 
  chain_rest  %>%
  mutate(dba = str_replace_all(dba, "[[:punct:]]", ""))

rest_neigh_str = rest_neighborhood %>% 
  mutate(dba = str_replace_all(dba, "[[:punct:]]", ""))

# Matching the two datasets(restaurant inspection data that has all punctuation removed from dba(restaurant name) and list of chain restaurants by dba)

neighborhood_chain = 
  right_join(rest_neigh_str, chain_rest_str) %>%
  filter(!is.na(camis)) %>%
  distinct(camis, dba, neighborhood, boro)
## Joining, by = "dba"

neighborhood_chain %>%
  group_by(dba) %>%
  summarise(n = n()) %>%
  arrange(desc(n)) %>%
  head(10)
## # A tibble: 10 x 2
##                       dba     n
##                     <chr> <int>
##  1          DUNKIN DONUTS   453
##  2                 SUBWAY   364
##  3              STARBUCKS   307
##  4              MCDONALDS   216
##  5 CHIPOTLE MEXICAN GRILL    78
##  6                 WENDYS    46
##  7              APPLEBEES    28
##  8          BOSTON MARKET    24
##  9           WHITE CASTLE    23
## 10              PIZZA HUT    21

The combined dataframe, neighborhood_chain has 1739 observations. Also, there were 33 different chain restaurants extracted. The chart is showing the 10 chain restaurants in descending order.

# counting chains in neighborhoods

neigh_count_chain = neighborhood_chain %>%
  group_by(neighborhood, boro) %>%
  summarise(chain_n = n())

neigh_count_rest = rest_neighborhood %>%
  distinct(neighborhood, camis) %>%
  group_by(neighborhood) %>%
  summarise(res_n = n())

# calculating percentage of chains in each boro
percent_neighborhood_chain = left_join(neigh_count_chain, neigh_count_rest) %>%
  ungroup() %>%
  mutate(chain_percentage = chain_n/res_n,
         neighborhood = str_to_upper(neighborhood))
## Joining, by = "neighborhood"

plot_chain_neighbor = percent_neighborhood_chain %>%
  mutate(neighborhood = forcats::fct_reorder(neighborhood, chain_percentage)) %>%
  ggplot(aes(neighborhood, chain_percentage, fill = boro)) + geom_bar(stat="identity") 

ggplotly(plot_chain_neighbor)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

We have plotted neighborhoods of NYC against their percentage of chain restaurants and grouped them by borough. We see that all the neighborhodds that had a percentage of chain restuarants less than 5% were all part of Brooklyn except Greenwhich village and Soho of Manhattan. Neighborhoods in Queens and Manhattan are spread out across low to high percentage of chain restaurants while most of neighborhoods in Bronx and Staten Island have high percentages. The neighbor hood that had the highest percentage of chain restaurants were Throgs neck and Co-op City of Bronx with 16.5% of chain restaurats out of all restaurats.

Inspection Grade

gradea_neighborhood =
  rest_neighborhood %>%
  group_by(neighborhood, grade) %>%
  summarise(n = n()) %>%
  mutate(grade_percent = n / sum(n)) %>% 
  filter(grade == "A") %>% 
  ungroup(boro) %>%
  mutate(neighborhood = str_to_upper(neighborhood))

gradea_neighborhood %>% 
  mutate(neighborhood = as.factor(neighborhood),
         neighborhood = fct_reorder(neighborhood,grade_percent)) %>% 
  plot_ly(x = ~neighborhood, y = ~grade_percent, color = ~neighborhood, type = "bar") %>%
  layout(showlegend = FALSE)
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

The difference of percentage of “grade-A” restaurants in each neighborhood is observed. “Throgs Neck and Co-op City” has the greatest grade-A restaurant percentage, around 44.50%. “Sunset Park”, however, has the least, around 30.62%. “Washington Heights”, where we live, takes the fourth from the end, 34.39%, which is obviously consistent with the feeling we have towards the restaurant condition of “Washington Heights”.

Formal Analysis and Findings

Model Selection Process

health_neighborhood = 
  health %>%
  mutate(neighborhood = str_to_upper(name)) %>%
  select(-name)

combined_chain = 
  percent_neighborhood_chain %>%
  select(neighborhood, chain_percentage)
combined_chain_fastfood = 
  fastfood_neighborhood %>%
  mutate(fastfood_percent = percent_fastfood) %>%
  select(neighborhood, fastfood_percent) %>%
  right_join(combined_chain, by = "neighborhood")
combined_chain_fastfood_gradea = 
  gradea_neighborhood %>%
  select(neighborhood, grade_percent) %>%
  right_join(combined_chain_fastfood, by = "neighborhood")

combined_model =  
  left_join(combined_chain_fastfood_gradea, health_neighborhood, by = "neighborhood")

We originally have two main predictors of interest, percentage of chain restaurants and percentage of fastfood restaurants, and they are both significantly associated with the three chronic disease health oucomes when solely in the model after adjusting for other potential confounders. Taking obesity as example, the p-value for fastfood_percent is 0.0006801903 and for chain_percentage is 0.012419514.

outcome_name = combined_model[,10:12]

main_predictor_selection = function(outcome){

lm_fastfood =   
  lm(outcome ~ fastfood_percent + grade_percent + racewhite_rate + poverty + smoking + exercise, combined_model) %>%
  broom::tidy() 

lm_chain =   
  lm(outcome ~ chain_percentage + grade_percent + racewhite_rate + poverty + smoking + exercise, combined_model) %>%
  broom::tidy() 

lm_both =   
  lm(outcome ~ fastfood_percent + chain_percentage + grade_percent + racewhite_rate + poverty + smoking + exercise, combined_model) %>%
  broom::tidy() 

lm_compare = bind_rows(lm_fastfood, lm_chain, lm_both)
}

map(outcome_name, main_predictor_selection)
## $obesity
##                term     estimate   std.error  statistic      p.value
## 1       (Intercept)  31.92112433 15.39841618  2.0730135 0.0431432532
## 2  fastfood_percent  59.42872810 16.44420636  3.6139615 0.0006801903
## 3     grade_percent -28.28758943 27.92319074 -1.0130500 0.3157269432
## 4    racewhite_rate  -0.09256862  0.04288468 -2.1585476 0.0355225376
## 5           poverty   0.05144636  0.09563734  0.5379318 0.5929191610
## 6           smoking   0.76326783  0.24626934  3.0993214 0.0031276157
## 7          exercise  -0.23249722  0.14391114 -1.6155610 0.1122411503
## 8       (Intercept)  26.92389210 16.12679737  1.6695126 0.1010260078
## 9  chain_percentage  68.24399981 26.34821588  2.5900805 0.0124195142
## 10    grade_percent  -6.19002972 27.74920917 -0.2230705 0.8243546127
## 11   racewhite_rate  -0.13087817  0.04186736 -3.1260190 0.0028985047
## 12          poverty   0.13579730  0.10658678  1.2740539 0.2083092151
## 13          smoking   0.83024365  0.25754271  3.2237124 0.0021876241
## 14         exercise  -0.22043017  0.15328119 -1.4380771 0.1564035643
## 15      (Intercept)  31.44610974 15.55231111  2.0219573 0.0484418351
## 16 fastfood_percent  52.82345911 22.03386102  2.3973764 0.0202135830
## 17 chain_percentage  15.25392784 33.53667726  0.4548431 0.6511523758
## 18    grade_percent -29.75987008 28.32416882 -1.0506882 0.2983551020
## 19   racewhite_rate  -0.09200984  0.04323297 -2.1282332 0.0381710626
## 20          poverty   0.07102154  0.10554794  0.6728842 0.5040586289
## 21          smoking   0.76030442  0.24825469  3.0625984 0.0034987282
## 22         exercise  -0.22233682  0.14673168 -1.5152612 0.1358801114
## 
## $diabetes
##                term    estimate   std.error  statistic      p.value
## 1       (Intercept) 17.17647608  6.47822240  2.6514181 1.059925e-02
## 2  fastfood_percent 16.91910979  6.91819372  2.4455964 1.788278e-02
## 3     grade_percent  1.69749345 11.74748349  0.1444985 8.856655e-01
## 4    racewhite_rate -0.07874514  0.01804189 -4.3645736 6.073260e-05
## 5           poverty  0.04837026  0.04023530  1.2021846 2.347381e-01
## 6           smoking  0.15236099  0.10360725  1.4705630 1.474349e-01
## 7          exercise -0.14495032  0.06054443 -2.3941147 2.030169e-02
## 8       (Intercept) 15.74127215  6.62715596  2.3752681 2.125819e-02
## 9  chain_percentage 18.23537394 10.82755192  1.6841641 9.814418e-02
## 10    grade_percent  8.59989177 11.40327696  0.7541597 4.541551e-01
## 11   racewhite_rate -0.09040128  0.01720500 -5.2543608 2.811443e-06
## 12          poverty  0.07079734  0.04380084  1.6163468 1.120708e-01
## 13          smoking  0.17304831  0.10583476  1.6350801 1.080722e-01
## 14         exercise -0.14288729  0.06298947 -2.2684315 2.748259e-02
## 15      (Intercept) 17.10694576  6.55462106  2.6099061 1.185888e-02
## 16 fastfood_percent 15.95226281  9.28631176  1.7178255 9.189529e-02
## 17 chain_percentage  2.23279537 14.13424728  0.1579706 8.751038e-01
## 18    grade_percent  1.48198821 11.93740224  0.1241466 9.016872e-01
## 19   racewhite_rate -0.07866335  0.01822081 -4.3172254 7.285939e-05
## 20          poverty  0.05123558  0.04448385  1.1517793 2.547826e-01
## 21          smoking  0.15192723  0.10462853  1.4520631 1.526087e-01
## 22         exercise -0.14346309  0.06184101 -2.3198699 2.438747e-02
## 
## $stroke_hosp
##                term     estimate   std.error   statistic      p.value
## 1       (Intercept)  128.7330288 159.5106855  0.80704956 4.233141e-01
## 2  fastfood_percent  573.5406491 170.3439236  3.36695690 1.436183e-03
## 3     grade_percent -414.8614240 289.2535987 -1.43424810 1.574881e-01
## 4    racewhite_rate   -1.3739246   0.4442382 -3.09276575 3.186416e-03
## 5           poverty    1.5643605   0.9906978  1.57904910 1.203893e-01
## 6           smoking    6.6038469   2.5510800  2.58864751 1.246526e-02
## 7          exercise    1.9095755   1.4907614  1.28093973 2.058984e-01
## 8       (Intercept)   76.3103399 173.7569200  0.43917871 6.623517e-01
## 9  chain_percentage  258.5291531 283.8867962  0.91067692 3.666679e-01
## 10    grade_percent    3.3420325 298.9816890  0.01117805 9.911242e-01
## 11   racewhite_rate   -1.9949362   0.4510966 -4.42241412 5.003035e-05
## 12          poverty    1.8462919   1.1484110  1.60769264 1.139582e-01
## 13          smoking    7.7931853   2.7748739  2.80848273 6.994899e-03
## 14         exercise    1.5659342   1.6515163  0.94817972 3.474238e-01
## 15      (Intercept)  146.1759388 157.2801408  0.92939858 3.570609e-01
## 16 fastfood_percent  816.0913302 222.8278961  3.66242892 5.940958e-04
## 17 chain_percentage -560.1362476 339.1555946 -1.65156128 1.047690e-01
## 18    grade_percent -360.7981184 286.4416246 -1.25958690 2.135524e-01
## 19   racewhite_rate   -1.3944434   0.4372140 -3.18938431 2.438602e-03
## 20          poverty    0.8455446   1.0674037  0.79215066 4.319421e-01
## 21          smoking    6.7126655   2.5105936  2.67373639 1.005194e-02
## 22         exercise    1.5364776   1.4838939  1.03543631 3.053501e-01

Since we also anticipated them to be highly associated with each other, to avoid collinearity, we will need to make decision on which one to keep as the final main predictor. So we put the two predictors in the same model and see how their p-values change. As a result, the fastfood_percent stays significant (p=0.020213583) and the chain_percentage turn insignificant (p=0.651152376).

Thus, we keep percentage of fastfood restaurants (fastfood_percent) as the main predictor.

outcome_name = combined_model[,10:12]

confounder_percent_change = function(outcome){
lm_adjusted = 
  lm(outcome ~ fastfood_percent + grade_percent + racewhite_rate + poverty + smoking + exercise, combined_model) %>%
  summary() 
lm_adjusted_tbl = 
  as.tibble(lm_adjusted[[4]]) %>%
  clean_names()

lm_crude =
  lm(outcome ~ fastfood_percent + racewhite_rate + poverty + smoking + exercise, combined_model) %>%
  summary() 
lm_crude_tbl = 
  as.tibble(lm_crude[[4]]) %>%
  clean_names()

percent_change = (lm_crude_tbl$estimate[2] - lm_adjusted_tbl$estimate[2]) /lm_crude_tbl$estimate[2]

confounder_percent_change = percent_change
}

map(outcome_name, confounder_percent_change)
## $obesity
## [1] -0.1883358
## 
## $diabetes
## [1] 0.03232614
## 
## $stroke_hosp
## [1] -0.3172496

Besides some biologically meaningful covariates (i.e. race, poverty, smoking status, exercise), we also hypothesize the variable percentage of grade A restuarants as a potential confounder in the association between fastfood restaurants percentage and the three chronic disease health outcomes.

Here, we are assessing if grade_percent is a significant confounder statistically. Using the 10% change rule of thumb, we find that after adjusting for percentage of grade A restuarants, the estimates of fastfood_percent change by 18.83% for outcome obsesity, 3.23% for outcome diabetes and 31.72% for outcome stroke hospitalization.

Thus, regarding the final models, we are going to keep grade_percent for obsesity and stroke hospitalization, but take it out and rerun the model for the outcome diabetes.

Final Models and Findings

Model 1: Obesity = fastfood_percent + grade_percent + racewhite_rate + poverty + smoking + exercise

lm(obesity ~ fastfood_percent + grade_percent + racewhite_rate + poverty + smoking + exercise, combined_model) %>%
  broom::tidy() %>%
  kable()
term estimate std.error statistic p.value
(Intercept) 31.9211243 15.3984162 2.0730135 0.0431433
fastfood_percent 59.4287281 16.4442064 3.6139615 0.0006802
grade_percent -28.2875894 27.9231907 -1.0130500 0.3157269
racewhite_rate -0.0925686 0.0428847 -2.1585476 0.0355225
poverty 0.0514464 0.0956373 0.5379318 0.5929192
smoking 0.7632678 0.2462693 3.0993214 0.0031276
exercise -0.2324972 0.1439111 -1.6155610 0.1122412

Model 2: Diabetes = fastfood_percent + racewhite_rate + poverty + smoking + exercise

lm(diabetes ~ fastfood_percent + racewhite_rate + poverty + smoking + exercise, combined_model) %>%
  broom::tidy() %>%
  kable()
term estimate std.error statistic p.value
(Intercept) 17.6786699 5.4163283 3.263958 0.0019270
fastfood_percent 17.4843100 5.6533466 3.092736 0.0031611
racewhite_rate -0.0778464 0.0167787 -4.639603 0.0000233
poverty 0.0469398 0.0386366 1.214906 0.2297869
smoking 0.1529457 0.1025675 1.491172 0.1418447
exercise -0.1443873 0.0598582 -2.412154 0.0193531

Model 3: Stroke_hosp = fastfood_percent + grade_percent + racewhite_rate + poverty + smoking + exercise

lm(stroke_hosp ~ fastfood_percent + grade_percent + racewhite_rate + poverty + smoking + exercise, combined_model) %>%
  broom::tidy() %>%
  kable()
term estimate std.error statistic p.value
(Intercept) 128.733029 159.5106855 0.8070496 0.4233141
fastfood_percent 573.540649 170.3439236 3.3669569 0.0014362
grade_percent -414.861424 289.2535987 -1.4342481 0.1574881
racewhite_rate -1.373925 0.4442382 -3.0927657 0.0031864
poverty 1.564361 0.9906978 1.5790491 0.1203893
smoking 6.603847 2.5510800 2.5886475 0.0124653
exercise 1.909576 1.4907614 1.2809397 0.2058984

We have three final models each having a health outcome (prevalence of obesity, prevalence of diabetes, and stroke hospitalization rate) as the dependent variable and percentage of fastfood as the main predictor. Obesity and percentage of fastfood restuarants model gave the most significant results with p-value of the main predictor being 0.0005364 (<0.01). In other words, at significance level of 1%, every 0.1 percent increase in fastfood restaurants in a neighborhood is associated with 5% increase in the neighborhood’s obesity prevalence, while adjusting for other factors.

The p-value of the regression coefficients for model with outcomes as diabetes/stroke were both arround 0.003 (<0.01), indicating there is a significant linear relationship between diabetese/stroke and percentage of fastfood restaurants at 1% significance level. To put in other words, at significance level of 1%, every 1 percent increase in fastfood restaurants in a neighborhood is associated with 17.5% increase in the neighborhood’s diabetes prevalence, while adjusting for other factors. Furthermore, every 1 percent increase in fastfood restaurants in a neighborhood is associated with increase in 573.5 stroke hospitalizations per 100,000 adults in the neighborhood, while adjusting for other factors.

It is reasonable that among the three health outcomes, obesity has the strongest linear association with food environment, in this case, percentage of fastfood restaurants. It is a health condition that has the most direct relation with one’s diet. Moreover, it has a higher prevalence than diabetes and stroke, which could lead to having a lower p-value than the other two health conditions.

Discussion

Our 2015 Community Health Profiles Open Data has data on obesity, diabetes prevalence rate from 2013 and stroke hospitalization rate from 2012. And the data of geographic distribution of restaurants is from 2014 to 2017. So the health profile data is somewhat earlier than the restaurant geographic distribution. However, as restaurant distribution and chronic disease prevalence rate wouldn’t change much over a few years, the lag between years don’t actually affect our results that much. Therefore, our results can still be valid.

Although the cuisine type of a restaurant is typically assumed to directly reflect the health level of the food the restaurant provides, we demonstrated here that even the unhealthiest restaurant can offer one or several healthy food, which may bring bias to the research. This was not a low-probability event, it should have required more efficient planning to resolve. Besides, the study at first treats chain restaurants as a cutoff point for unhealthy restaurants. Obviously this cutoff point is ambiguous and farfetched. In the end, the study doesn’t include this as a predictor.

Lastly, although there is an association between fast-food restaurant distribution and health outcomes, we could not conclude that it is fast-food restaurant causing these health outcomes. Maybe it is because the citizens living in the neighborhood are obese indicating that they like to eat fast food. This will actually result in more fast-food restaurant opening in this neighborhood since the business will be good. Further study need to be conducted before concluding any causation effect.

Conclusion

Overall, we conclude that there is a significant relationship between obesity/diabetes/stroke and the geographical distribution of fast-food restaurants.